While the current scholarly effort of literature review focuses on understanding published works’ vision, content, method, results, limitation, etc., we aim to find meaningful information from research papers’ acknowledgment section. The acknowledgment section appears in most research papers but does not gather much interest as we know. We want to understand the different aspects of the acknowledgment section, how they are organized, and within a specific field, are there frequently mentioned names and entities? In addition, we will discuss how to incorporate these findings to present helpful information to readers when they use search engines looking for related research interests.
Original Dataset
The original dataset of 64 papers was provided to us as a large JSON file that had a lot data within it. For our analysis of acknowledgements sections we only needed a few data points to get started. The original dataset is available below for exploration (minor change just to make it render nicely).
Show Code for Loading the Original Dataset
from IPython.display import JSONimport jsonwithopen("data/599_lit_review.json", "r") as open_f: original_dataset = json.load(open_f)JSON({"data": original_dataset})
<IPython.core.display.JSON object>
Compiled Dataset
For our analysis, we really only needed some metadata and a view or download link for each paper which we could then manually go and copy-paste any acknowledgements section into our dataset (we have some thoughts as to how to automate this in a later section).
To extract the data we needed we ran the following code:
Show Code for Compile Dataset for Manual Addition
import pandas as pdcompiled_rows = []for index, paper inenumerate(original_dataset):# Some papers have data from CSL and some from S2# Get both so we don't really have to care later on# Check if the paper has CSL data at allif paper.get("csl", None) isnotNone:# Find or get title and url returned by CSL data csl_title = paper["csl"].get("title", None) csl_url = paper["csl"].get("URL", None)else: csl_title =None csl_url =None# Check if the paper has Semantic Scholar data at allif paper.get("s2data", None) isnotNone:# Find or get title and url returned by S2 data s2_title = paper["s2data"].get("title", None) s2_url = paper["s2data"].get("url", None)else: s2_title =None s2_url =None# Compile all results compiled_rows.append({"paper_index": index,"doi": paper["doi"],"s2id": paper.get("s2id", None),"s2_url": s2_url,"csl_url": csl_url,"s2_title": s2_title,"csl_title": csl_title,"acknowledgements_text": None, })compiled_dataset = pd.DataFrame(compiled_rows)
Our dataset after adding all the acknowledgements sections is available below:
Read and Show Data with Acknowledgements Sections Added
from itables import showimport itables.options as table_optstable_opts.lengthMenu = [5, 10, 25, 50]raw_data = pd.read_csv("data/raw-ack-sections.csv")show(raw_data)
We can now take each of these acknowledgements sections and run them through a named entity recognition model.
import spacynlp = spacy.load("en_core_web_trf")# Filter dataset to only include rows with acknowledgements sectionsfiltered_data = raw_data.dropna(subset=["acknowledgements_text"])# For each acknowledgement, run it through spacy,# extract entities and their labels and store to a dataframeentities_rows = []docs = []for _, paper in filtered_data.iterrows(): doc = nlp(paper.acknowledgements_text) docs.append(doc)for ent in doc.ents:# Store with the DOI so we can join with other data later entities_rows.append({"doi": paper.doi,"entity": ent.text,"entity_label": ent.label_, })entities = pd.DataFrame(entities_rows)
# How did the model tag each of these examples?from ipywidgets import interactfrom IPython.display import display, HTMLfrom spacy import displacy@interactdef render_example(doc_index=list(range(len(docs)))):return display(HTML(displacy.render(docs[doc_index], style="ent")))